Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

32 ◾ Bioinformatics

due to the gene expression and that overrepresentation can be of a biological importance

rather than a bias. The overrepresented sequences report is a table that shows the over-

represented sequences, counts, percentage, and possible source. To save memory, only the

first 200,000 reads are checked in the FASTQ file; therefore, the list is not exhaustive and

other overrepresented sequences may skip the check. For each overrepresented sequence,

the FastQC program will search on a database of known contaminants and report the best

match that is at least 20 bases in length and has no more than a single mismatch. A warn-

ing will be issued if a sequence is overrepresented more than 0.1% of the total and failure

will occur if the overrepresentation is more than 1% of the total. As shown in Figure 1.24,

five overrepresented sequences are found, three of which are contaminating adaptors and

two sequences have no hits. The count and percentage reflect the significance of each of

these overrepresented sequences. The count of the first sequence in the table represents

29.4% of the total count of the reads in the FASTQ file. It is clear that this sequence is origi-

nated from a primer contamination and it must be removed before analysis.

1.5.11 Adapter Content

The full-length adaptor primers may cause contaminating adaptor dimers of a significant

number of reads. The adaptor content graph shows the cumulative percentage count of

FIGURE 1.23 Sequence duplication levels (warning and failure).

FIGURE 1.24 Overrepresented sequences.